Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction

In computed tomography (CT), the forward model consists of a linear Radon transform followed by an exponential nonlinearity based on the attenuation of light according to the Beer–Lambert Law. Conventional reconstruction often involves inverting this nonlinearity as a preprocessing step and then solving a convex inverse problem. However, this nonlinear measurement preprocessing required to use the Radon transform is poorly conditioned in the vicinity of high-density materials, such as metal. This preprocessing makes CT reconstruction methods numerically sensitive and susceptible to artifacts near high-density regions. In this paper, we study a technique where the signal is directly reconstructed from raw measurements through the nonlinear forward model. Though this optimization is nonconvex, we show that gradient descent provably converges to the global optimum at a geometric rate, perfectly reconstructing the underlying signal with a near minimal number of random measurements. We also prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal. This is achieved by enforcing prior structural information about the signal through constraints on the optimization variables. We illustrate the benefits of direct nonlinear CT reconstruction with cone-beam CT experiments on synthetic and real 3D volumes. We show that this approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns.


INTRODUCTION
Computed tomography (CT) is a core imaging modality in modern medicine (Food & Administration, 2023).X-ray CT is used to diagnose a wide array of conditions, plan treatments such as surgery or chemotherapy, and monitor their effectiveness over time.It can image any part of the body, and is widely performed as an outpatient imaging procedure.
CT systems work by rotating an X-ray source and detector around the patient, measuring how much of the emitted X-ray intensity reaches the detector at each angle.Because different tissues absorb X-rays at different rates, each of these measurements records a projection of the patient's internal anatomy along the exposure angle.Algorithms then combine these projection measurements at different angles to recover a 2D or 3D image of the patient.This image is then interpreted by a medical professional (e.g.physician, radiologist, or medical physicist) to help diagnose, monitor, or plan treatment for a disease or injury.
CT scanners in use today typically consider the image reconstruction task as a linear inverse problem, in which the measurements are linear projections of the signal at known angles.Omitting measurement noise, we can write this standard linear measurement model as: ŷi = a T i x, (1.1) where x is a vectorized version of the unknown signal (which commonly lies in 2D or 3D) and a i is a known, nonnegative measurement vector that denotes the weight each entry in the signal contributes to the integral ŷi along measurement ray i. Computed over a set of regularly-spaced ray angles, this is exactly the Radon transform (Radon, 1917).This linear measurement model is quite convenient, as it enables efficient computations using the Fourier slice theorem-which equates linear projections in real-space to evaluation of slices through Fourier space-as well as strong recovery guarantees from compressive sensing (Kak & Slaney, 2001;Foucart & Rauhuti, 2013;Bracewell, 1990).
This linear projection model is accurate for signals of low density, for which the incident X-rays pass through largely unperturbed.
However, consider the common setting in which the signal contains regions of density high enough to occlude X-rays, such as the metal implants used in dental crowns and artificial joints.Such high-density regions produce nonlinear measurements for which the Fourier slice theorem, and standard compressed sensing results, no longer hold.In practice, tomographic reconstruction algorithms that assume a linear projection as the measurement model produce streak-like artifacts around high-density regions, potentially obscuring otherwise measurable and meaningful signal.
To avoid such artifacts, in this paper we consider a nonlinear measurement model, which correctly models signals with arbitrary density.Equation (1.1) then becomes: where the exponential nonlinearity accounts for occlusion and is due to the Beer-Lambert Law.In practice, the partial occlusions captured by eq.(1.2) are commonly incorporated into a linear model by inverting the nonlinearity, converting raw measurements y i from eq. ( 1.2) into processed measurements ŷi = − ln(1 − y i ) for which eq. (1.1) holds.Indeed, this logarithmic preprocessing step is built into commercial CT scanners (Fu et al., 2016), though some additional preprocessing for calibration and denoising is often performed before the logarithm.The logarithm is well-conditioned for y i ≈ 0 but becomes numerically unstable as y i approaches unity, which corresponds to total X-ray absorption.This is particularly problematic for rays that pass through high-density materials, such as metal, as well as for very low-dose CT scans that use fewer X-ray photons.
Instead, we study reconstruction through direct inversion of eq.(1.2) via iterative gradient descent.We optimize a squared loss function over these nonlinear measurements y i , which is optimal for the case of Gaussian measurement noise-though extending this analysis to more realistic noise models is also of interest (Fu et al., 2016).By avoiding the ill-conditioned logarithm of a near-zero measurement, this approach is well-suited to CT reconstruction with low-dose X-rays as well as CT reconstruction with reduced metal artifacts.However, direct reconstruction through eq.(1.2) is more challenging than reconstruction through the linearized eq.(1.1), because the resulting loss function is nonconvex.While the linear inverse problem defined by eq. ( 1.1) can be solved in closed form by methods such as Filtered Back Projection (Radon, 1917;Kak & Slaney, 2001;Natterer, 2001), the nonlinear inverse problem defined by eq. ( 1.2) requires an iterative solution to a nonconvex optimization problem.We show that gradient descent with appropriate stepsize successfully recovers the global optimum of this nonconvex objective, suggesting that direct optimization through eq.(1.2) is a viable and desirable alternative to current methods that use eq.(1.1).
Concretely, we make the following contributions: • We propose a Gaussian model of eq.(1.2) and show that gradient descent converges to the global optimum of this model at a geometric rate, despite the nonconvex formulation.These results hold with a near minimal number of random measurements.To prove this result we utilize and build upon intricate arguments for uniform concentration of empirical processes.
• We also extend our result to a compressive sensing setting to show that a structured signal can be recovered from far fewer measurements than its dimension.In this case prior information about the signal structure is enforced via a convex regularizer; our result holds for any convex regularizer.We show that the required number of measurements is commensurate to an appropriate notion of statistical dimension that captures how well the regularizer enforces the structural assumptions about the signal.For example, for an s-sparse signal and an ℓ 1 regularizer our results require on the order of s log(n/s) measurements, where n is the dimension of the signal.This is the optimal sample complexity even for linear measurements.
• We perform an empirical comparison of reconstruction quality in 3D cone-beam CT on both synthetic and real volumes, where our real dataset consists of a human skull with metal dental crowns.We show that direct reconstruction through eq.(1.2) yields reduction in metal artifacts compared to reconstruction by inverting the nonlinearity into eq.(1.1).

PROBLEM FORMULATION
In practice, the measurement vectors a i are sparse, nonnegative, highly structured, and dependent on the rays i, as only a small subset of signal values in x will contribute to any particular ray.These vectors correspond to the weights in a discretized ray integral (projection) along the i'th measurement ray in the Radon transform (Radon, 1917).We use these real, ray-structured measurement vectors in our synthetic and real-data experiments.
In our theoretical analysis, we make two simplifying alterations to eq. (1.2): (1) we model a i as a standard Gaussian vector, where the Gaussian randomness is an approximation of the randomness in the choice of ray direction, and (2) we wrap the inner product a T i x in a ReLU, to capture the physical reality that the raw integral of density along a ray, and the corresponding sensor measurement, must always be nonnegative.This nonnegativity is implicit in eq.(1.2) because a i represents a ray integral with only nonnegative weights as its entries, and the true density signal x is also nonnegative; in our model eq.( 2.1) we make nonnegativity explicit (subscript + denotes ReLU): (2.1) Here y i ∈ R is a measurement corresponding to ray i, x ∈ R n is the signal we want to recover, and a i ∈ R n are i.i.d.random Gaussian measurement vectors distributed as N (0, I n ).In this paper we consider a least-squares loss of the form which is optimal in the presence of Gaussian measurement noise.However, in our analysis we focus on the noiseless setting.We minimize this loss using subgradient descent starting from z 0 = 0 n , with step size µ t in step t.More specifically, the iterates take the form Here, we use the following subdifferential of f : We also consider a regularized (compressive sensing) setting where the number of measurements m is significantly smaller than the dimension n of the signal.In this case we optimize the augmented loss function via subgradient descent.Here, R(z) is a regularizer enforcing a priori structure about the signal, with regularization weight λ.In our experiments, we use 3D total variation as R, to encourage our reconstructed structure to have sparse gradients in 3D space.
We note that L is a nonconvex objective, so it is not obvious whether or not subgradient descent will reach the global optimum.Do the iterates converge to the correct solution?How many iterations are required?How many measurements?How does the number of measurements depend on the signal structure and the choice of regularizer?In the following sections, we take steps to answer these questions.

GLOBAL CONVERGENCE IN THE UNREGULARIZED SETTING
Our first result shows that in the unregularized setting, direct gradient-based updates converge globally at a geometric rate.We defer the proof of Theorem 1 to Appendix B.
Theorem 1 Consider the problem of reconstructing a signal x ∈ R n from m nonlinear CT measurements of the form y i = 1 − e −(a T i x)+ , where the measurement vectors a i are generated i.i.d.N (0, I n ).We consider a least-squares loss as in eq.(2.2) and run gradient updates of the form and µ t = µe −5∥x∥ ℓ 2 with µ ≤ c 0 for t > 1.Here, erfc is the complementary error function.As long as the number of measurements obeys holds with probability at least 1 − 5e −c3n − 3e − m 2 .Here, c 0 , c 1 , c 2 , and c 3 are fixed positive numerical constants.
Theorem 1 answers some of the key questions from the previous section in the affirmative.Even though the nonlinear CT reconstruction problem is a nonconvex optimization, gradient descent converges to the global optimum, the true signal, at a geometric rate.
Further, the number of required measurements m is on the order of n, the dimension of the signal, which is near-minimal even for a linear forward model.In Theorem 2 we prove global convergence with even fewer measurements in the compressive sensing setting, when some prior knowledge of the signal structure is enforced through a convex regularizer.
We note that the initial step size µ 1 used in Theorem 1 is a function of the signal norm ∥x∥ ℓ2 , which is a priori unknown.However, we briefly describe how this quantity can be estimated from the available measurements.By averaging over the m measurements, we have where g i are i.i.d.standard Gaussian random variables.Since e −gi+∥x∥ ℓ 2 is a 1-Lipschitz function of g i , this quantity concentrates around its mean We can invert this relationship to get a close estimate of ∥x∥ ℓ2 from the average measurement value.
We also note that both the convergence rate and the number of measurements in Theorem 1 are exponentially dependent on ∥x∥ ℓ2 .This is natural because as ∥x∥ ℓ2 increases towards infinity the measurements y i = 1 − e −(a T i x)+ approach the constant value 1 and the corresponding gradient of the loss approaches zero.Intuitively, this corresponds to trying to recover a CT scan of a metal box; if the walls of the box become infinitely absorbing of X-rays, we cannot hope to see inside it.Nonetheless, for real and realistic metal components in our experiments (Section 5) we do find good signal recovery following this approach.

GLOBAL CONVERGENCE IN THE REGULARIZED SETTING
We now turn our attention to the regularized setting.Our measurements again take the form y i = 1 − e −(a T i x)+ for i = 1, 2, . . ., m, where x ∈ R n is the unknown but now a priori "structured" signal.In this case we wish to use many fewer measurements m than the number of variables n, to reduce the X-ray exposure to the patient without sacrificing the resolution of the reconstructed image or volume x.Because the number of equations m is significantly smaller than the number of variables n, there are infinitely many reconstructions obeying the measurement constraints.However, it may still be possible to recover the original signal by exploiting knowledge of its structure.To this aim, let R : R n → R be a regularization function that reflects some notion of "complexity" of the "structured" solution.For the sake of our theoretical analysis we will use the following constrained optimization problem in lieu of eq.(2.3) to recover the signal: We solve this optimization problem using projected gradient updates of the form Here, P K (z) denotes the projection of z ∈ R n onto the constraint set We wish to characterize the rate of convergence of the projected gradient updates eq. ( 4.2) as a function of the number of measurements, the available prior knowledge of the signal structure, and how well the choice of regularizer encodes this prior knowledge.For example, if we know our unknown signal x is approximately sparse, using an ℓ 1 norm for the regularizer is superior to using an ℓ 2 regularizer.To make these connections precise and quantitative, we need a few definitions which we adapt verbatim from Oymak et al. (2017); Oymak & Soltanolkotabi (2017b); Soltanolkotabi (2019b).
Definition 1 (Descent set and cone) The set of descent of a function R at a point x is defined as The cone of descent, or tangent cone, is the conic hull of the descent set, or the smallest closed cone C R (x) that contains the descent set, i.e.D R (x) ⊂ C R (x).
The size of the descent cone C R determines how well the regularizer R captures the structure of the unknown signal x.
The smaller the descent cone, the more precisely the regularizer describes the properties of the signal.We quantify the size of the descent cone using the notion of mean (Gaussian) width.
Definition 2 (Gaussian width) The Gaussian width of a set C ∈ R p is defined as: where the expectation is taken over g ∼ N (0, I p ).Throughout we use B n /S n−1 to denote the the unit ball/sphere of R n .
We now have all the definitions in place to quantify how well the function R captures the properties of the unknown signal x.This naturally leads us to the definition of the minimum required number of measurements.
Definition 3 (minimal number of measurements) Let C R (z) be a cone of descent of R at z.We define the minimal sample function as We shall often use the short hand m 0 = M(R, z) with the dependence on R, z implied.Here we define m 0 for an arbitrary point z, but we will apply the definition at the signal x.
We note that m 0 is exactly the minimum number of samples required for structured signal recovery from linear measurements when using convex regularizers.Specifically, the optimization problem succeeds at recovering the unknown signal x with high probability from m measurements of the form y i = a T i x if and only if m ≥ m 0 . 1 Given that in our Gaussian-approximated nonlinear CT reconstruction problem we have less information (we lose information when the input to the ReLU is negative), we cannot hope to recover structured signals from m ≤ m 0 when using (4.1).Therefore, we can use m 0 as a lower-bound on the minimum number of measurements required for projected gradient descent iterations eq. ( 4.2) to succeed in recovering the signal of interest.With these definitions in place we are now ready to state our theorem in the regularized/compressive sensing setting.We defer the proof of Theorem 2 to Appendix C. 1 We would like to note that m0 only approximately characterizes the minimum number of samples required.A more precise However, since our results have unspecified constants we avoid this more accurate characterization.
Theorem 2 Consider the problem of reconstructing a signal x ∈ R n from m nonlinear CT measurements of the form y i = 1 − e −(a T i x)+ , where the measurement vectors a i are generated i.i.d.N (0, I n ).We consider a constrained least-squares loss as in eq.(4.1) and run projected gradient updates of the form in eq.(4.2) starting from z 0 = 0 n with Here, erfc is the complementary error function.As long as the number of measurements obeys m 0 , with m 0 denoting the minimal number of samples per Definition 3, then holds with probability at least 1 Here, c 0 , c 1 , c 2 , and c 3 are fixed positive numerical constants.
Theorem 2 parallels Theorem 1, likewise showing fast geometric convergence to the global optimum despite nonconvexity.In this regularized setting, the sample complexity of our nonlinear reconstruction problem is on the order of m 0 , the number of measurements required for linear compressive sensing.In other words, the number of measurements required for regularized nonlinear CT reconstruction from raw measurements is within a constant factor of the number of measurements needed for the same reconstruction from linearized measurements.This is the optimal sample complexity for this nonlinear reconstruction task.For instance for an s sparse signal for which m 0 ∝ s log(n/s), the above theorem states that on the order of s log(n/s) nonlinear CT measurements suffices for our direct gradient-based approach to succeed.Finally, we would like emphasize that the above result is rather general as it applies to any type structure in the signal and can also deal with any convex regularizer.

EXPERIMENTS
We support our theoretical analysis with experimental evidence that gradient-based optimization through the nonlinear CT forward model is effective for a wide range of signal densities, including signals that are dense enough that the same optimization procedure through the linearized forward model produces noticeable "metal artifacts." All of our experiments are based on the JAX implementation of Plenoxels (Sara Fridovich-Keil and Alex Yu et al., 2022), with a dense 3D grid of optimizable density values connected by trilinear interpolation.We use a cone-beam CT setup and optimize with mild total variation regularization.Our experiments do not focus on speed or measurement sparsity, though we fully expect our optimization objective to pair naturally with efficient ray sampling implementations and regularizers of choice.

SYNTHETIC DATA
Our synthetic experiments use a ground truth volume defined by the standard Shepp-Logan phantom (Shepp & Logan, 1974) in 3D, with the following modifications: • We scale down the voxel density values by a factor of 4, to more closely mimic the values in our real cone-beam CT skull dataset.• We adjust one of the ellipsoids to be slightly larger than standard (to make it more visible), and gradually increase its ground truth density to simulate a spectrum from soft tissue to bone to metal.
We simulate CT observations of this synthetic volume and then reconstruct using either the linearized forward model with the logarithm and eq.(1.1), or directly using eq.(1.2).We also use a small amount of total variation regularization, and constrain results to be nonnegative.
Results of this synthetic experiment are presented in Figure 1.As the density of the test ellipsoid increases, the linearized reconstruction experiences increasingly severe "metal artifacts," while the nonlinear reconstruction continues to closely match the ground truth.PSNR values are reported over the entire reconstructed volume compared to the ground truth, where PSNR is defined as −10 log 10 (MSE) and MSE is the mean squared voxel-wise error.
Note that this synthetic experiment does not include any measurement noise or miscalibration; the instability of the logarithm with respect to dense signals arises even when the only noise is due to numerical precision.We also note that even the densest synthetic "metal" ellipsoid we test is no denser than what we observe in our real CBCT skull dataset in Figure 2, with a real metal dental crown.Nonlinear reconstruction is robust even to dense "metal" elements of the target signal.

REAL DATA
Our real data experiment uses a cone-beam CT phantom made from a human skull with metal dental crowns on some of the teeth.In Figure 2 we show slices of our nonlinear reconstruction compared to a reference commercial linearized reconstruction, as no ground truth is available for the real volume.The nonlinear reconstruction exhibits reduced metal-induced streak artifacts compared to the commercial reconstruction, highlighted in red and purple in the leftmost panel.
Figure 2: Real experiments using a 3D human skull phantom with a metal dental crown; here we show a cross-section of the reconstructed volume.Note the streak artifact to the left of the metal crown (annotated in red) and the X artifact below it (annotated in purple) in the reference linearized reconstruction, which are not present in the nonlinear reconstruction.

RELATED WORK
Tomographic reconstruction.The measurement model in eq. ( 1.2) is a discretized corollary of the Beer-Lambert Law that governs the attenuation of light as it passes through absorptive media.Inverting the exponential nonlinearity in this model recovers the Radon transform summarized by eq.(1.1), in which measurements are linear projections of the signal at chosen measurement angles.The Radon transform has a closed-form inverse transform, Filtered Back Projection (Radon, 1917;Kak & Slaney, 2001;Natterer, 2001), that leverages the Fourier slice theorem (Bracewell, 1990;Kak & Slaney, 2001).Filtered Back Projection is a well-understood algorithm that can be computed efficiently, and is a standard option in commercial CT scanners, but its reconstruction quality can suffer in the presence of either limited measurement angles or metal (highly absorptive) signal components (Fu et al., 2016).
Many methods exist to improve the quality of CT reconstruction in the limited-measurement regime, such as limited baseline tomography, which is of clinical interest because not all viewpoints may be accessible and every measurement angle requires exposing the patient to ionizing X-ray radiation.These methods typically involve augmenting the datafidelity loss function with a regularization term that describes some prior knowledge of the signal to be reconstructed.Such priors include sparsity (implemented through an ℓ 1 norm) in a chosen basis, such as wavelets (Foucart & Rauhuti, 2013;Chambolle et al., 1998), as well as gradient sparsity (implemented through total variation regularization) (Candes et al., 2006).Compressive sensing theory guarantees correct recovery with fewer measurements in these settings, as long as the true signal is well-described by the chosen prior (Foucart & Rauhuti, 2013).CT reconstruction with priors cannot be solved in closed form, but as long as the regularization is convex we are guaranteed that iterative optimization methods such as gradient descent, ISTA (Chambolle et al., 1998), and FISTA (Beck & Teboulle, 2009) will be successful.
Recently, reconstruction with even fewer measurements has been proposed by leveraging deep learning, through either neural scene representation (Rückert et al., 2022) or data-driven priors (Szczykutowicz et al., 2022).These methods may sacrifice convexity, and theoretical guarantees, in favor of more flexible and adaptive regularization that empirically reduces reconstruction artifacts in the limited-measurement regime.However, these methods are still based on the linear measurement model of eq.(1.1), making them susceptible to reconstruction artifacts near highly absorptive metal components.In some cases neural methods may reduce metal artifacts compared to traditional algorithms, but this reduction is achieved by leveraging strong and adaptive prior knowledge rather than the measurements of the present signal.
We propose to resolve these metal artifacts by reconstructing from raw nonlinear X-ray absorption measurements, rather than the preprocessed, linearized measurements produced by standard CT scanners.Our method may pair particularly well with new photon-counting CT scanners (Shikhaliev et al., 2005), which were approved by the FDA in 2021 (Food & Administration, 2021).These scanners measure raw X-ray photon counts, which should enable finer-grained noise modeling and correction as well as our method for principled reconstruction of signals with metal.
Signal reconstruction from nonlinear measurements.There are a growing number of papers focused on reconstructing a signal from nonlinear measurements or single index models.Early papers on this topic focus on phase retrieval and ReLU nonlinearities (Oymak et al., 2018;Soltanolkotabi, 2017;Candes et al., 2015) and approximate reconstruction (Oymak & Soltanolkotabi, 2017a).These papers do not handle the compressive sensing/structured signal reconstruction setting.The paper (Soltanolkotabi, 2019a) deals with reconstruction from structured signals for intensity and absolute value nonlinearties but only achieves the optimal sample complexity locally.A more recent paper (Mei et al., 2018) deals with a variety of nonlinearities with bounded derivative activations.However, this paper does not handle non-differentiable activations and only deals with simple structured signals such as sparse ones.In contrast to the above paper, our activation is non-differentiable, we handle arbitrary structures in the signal, and our results apply for any convex regularizer.

DISCUSSION
In this paper, we consider the CT reconstruction problem from raw nonlinear measurements of the form for a signal x and random measurement weights a i .Although this nonlinear measurement model can be easily transformed into a linear model via a logarithmic preprocessing step ŷi = − ln(1 − y i ) = a T i x, and this transformation is common practice in clinical CT reconstruction, the logarithm is numerically unstable when the measurements approach unity.This occurs frequently in practice, notably when the signal x contains metal and especially for low-dose CT scanners that reduce radiation exposure.In this setting, traditional linear reconstruction methods tend to produce "metal artifacts" such as streaks around metal implants.Reconstruction directly through the raw nonlinear measurements avoids this numerically unstable preprocessing, in exchange for solving a nonconvex nonlinear least squares objective instead of convex linear least squares.
We prove that gradient descent finds the global optimum in CT reconstruction from raw nonlinear measurements, recovering exactly the true signal x despite the nonconvex optimization.Moreover, it converges at a geometric rate, which is considered fast even for convex optimization.This nonconvex optimization requires order n measurements, where n is the dimension of the unknown signal, the same order sample complexity as if we had reconstructed through a linear forward model.
We also extend our theoretical results to the compressive sensing setting, in which prior structural knowledge of the signal x, enforced through a regularizer, allows for reconstruction with far fewer measurements than the dimension of the signal.Our results in this setting again parallel standard results from the linear reconstruction problem, even though we consider a nonlinear forward model and optimize a nonconvex formulation.
We also compare linearized and nonlinear CT reconstruction experimentally in the setting of 3D cone-beam CT, using both a synthetic 3D Shepp-Logan phantom for which we know the ground truth volume as well as a real human skull phantom with metal dental crowns.In both cases, we find that nonlinear reconstruction reduces metal artifacts compared to linearized reconstruction, whether that linearized reconstruction is done by gradient descent or a commercial algorithm.
Our work is a promising first step towards higher-quality CT reconstruction in the presence of metal components and low-dose X-rays, offering both practical and theoretical guidance for trustworthy reconstruction.Future work may extend our results both theoretically and experimentally to consider more realistic measurement noise settings such as Poisson noise, which is particularly timely given the emergence of new photon-counting CT scanners.

A APPENDIX B PROOF OF THEOREM 1
In the following subsections we prove Theorem 1, beginning with a proof outline.

B.1 PROOF OUTLINE
The proof consists of the following four steps.
Step I: First iteration: Welcome to the neighborhood.In the first step, we show that as long as the number of measurements is sufficiently large (m ≥ 5n), the first iteration obeys with probability at least 1 − 2e − m 2 − e −cn , for a constant c.We prove this in Appendix B.2.
Step II: Local pseudoconvexity.In the second step, we show that our nonconvex objective function is locally strongly pseudoconvex inside this local neighborhood of eq.(B.1).Specifically, we show the correlation inequality ) , with probability at least 1 − 4e −n as long as the number of measurements m is at least Step III: Smoothness.In the third step, we show holds with probability at least 1 − e − 1 2 (m+n) , for a constant L. This smoothness condition is proved in Appendix B.5.
Step IV: Completing the proof via combining Steps I-III.Finally, we combine the first three steps into a complete proof.At the core of the proof is the following lower bound on the correlation between the loss gradient and the error vector for positive constants A and B. This starting point is similar to the proof of Lemma 7.10 in Candes et al. (2015), with some modifications necessary for the nonlinear CT reconstruction problem.To prove eq. (B.4) we first combine eq.(B.2) and eq.(B.3) to conclude that Thus eq.(B.4) holds with A = α 2 and B = C for a new constant C, with probability at least 1 − 4e −n − e − 1 2 (m+n) − 2e − m 2 − e −cn , for a constant c (by a union bound over the first three steps of the proof).Using eq.(B.4) with adequate choice of stepsize suffices to prove geometric convergence.
. In (a) we expand the square.In (b) we apply eq.(B.4).In (c) we choose µ t+1 ∈ (0, 2B], making the second term negative.Applying this relation inductively over T steps of gradient descent yields geometric convergence with rate 1 − 2µA, provided that the step size µ t is less than 2B for t > 1.
The number of measurements m required in theorem 1 is the maximum over the number of samples required for each of the first two steps of the proof.For the first step, m ≥ 5n measurements are sufficient to reach our neighborhood of radius 1 4 ∥x∥ ℓ2 .For case 1 in the second step, m ≥ c1n α 2 ∥x∥ 2 ℓ 2 measurements are sufficient for correlation concentration.
For case 2 in the second step, m ≥ c2n ( √ 2α− √ α) 2 measurements are sufficient for concentration.Maximizing over these bounds yields the result in theorem 1 (note that the constants change during the maximization).

B.2 FIRST STEP: WELCOME TO THE NEIGHBORHOOD
We first consider what happens in expectation when we take our first gradient step starting from an initialization at z 0 = 0.The expectation is over the randomness in the Gaussian measurement vector a.We have In (a) we evaluate f (0) = 0 and f ′ (0) = 1 2 , where for the latter we use 1 2 as the sub-differential even though f is nondifferentiable at 0 due to the non-differentiability of ReLU (this choice is also justified as it is the expected gradient of f around a small random initialization around 0).In (b) we plug in the value of the measurement y.In (c) we use linearity of expectation and evaluate the first term, E a [a] = 0 n .In (d) we separate the leading a into components parallel and orthogonal to x, and evaluate the expectation of the orthogonal term to zero.In (e) we replace a T x with g ∥x∥ ℓ2 for a scalar Gaussian g, as these have the same distribution.In (f) we evaluate the remaining expectation.In (g) so that in expectation, our first step exactly recovers the signal x.
Because in practice we do not have access to an expectation over infinite measurements, we also care about the concentration of this first gradient step.This first-step concentration determines the local neighborhood around the signal x that our gradient descent will operate within for the remaining iterations.
In (a) we separate a i into components parallel and orthogonal to x.In (b) we replace a T i x with ∥x∥ ℓ2 g i for a standard scalar Gaussian g i , as these have the same distribution.In (c) we simplify the second term by rewriting it with a standard Gaussian vector a ⊥ with n − 1 free dimensions (constrained to be orthogonal to x), and a ⊥ independent of g i for all i.In (d) we write ĝi := g i (1 − exp(−(g i ) + ∥x∥ ℓ2 )) and gi := (1 − exp(−(g i ) + ∥x∥ ℓ2 )) 2 .
Note that the first term has expectation µ1 4 exp x = x computed above, and the second term has expectation zero because gi and a ⊥ are independent, and a ⊥ is mean zero.The first term is aligned with the signal x but with a scaling factor µ 2m∥x∥ ℓ 2 m i=1 ĝi ; we can bound the deviation of this scaling from its mean using Hoeffding's concentration bound for sums of sub-Gaussian random variables.We use the definition of sub-Gaussianity provided by Definition 2.2 in Wainwright (2019), for which ĝi has sub-Gaussian parameter 1.We omit the leading constant , and apply the Hoeffding bound as presented in Proposition 2.5 in Wainwright (2019) to conclude that m i=1 The second term is a nonnegative scaling µ 2m m i=1 gi times a standard n-dimensional Gaussian vector a ⊥ with n − 1 degrees of freedom (because it is orthogonal to x).Since gi is bounded in [0, 1], we can upper bound this scaling by µ 2 √ m .Then the norm of the random vector a ⊥ can also be bounded using Exercise 5.2.4 in Vershynin (2018), to conclude that with probability at least 1 − e −ct 2 n for a constant c, where in (a) we evaluate the expectation of a Gaussian norm with n − 1 degrees of freedom and in (b) we upper bound n − 1 by n and do a change of variables to replace t with t √ n.
Putting these together with a union bound, we have that , where in (a) we change variables and replace s with sm, and c is a constant.If we choose s = t = 1, this simplifies to with probability at least 1 − 2e − m 2 − e −cn , for a constant c.For our first step to lie within a distance of 1 4 ∥x∥ ℓ2 , we need the number of measurements to satisfy The denominator in this expression is lower bounded by 0.2 for all ∥x∥ ℓ2 , so we can also guarantee the first step concentration to this neighborhood using m ≥ 5n measurements.

B.3 CORRELATION CONCENTRATION: CASE 1
Consider the correlation where h := z − x.Now note that if we have a T i x ≥ 0 and a T i h ≥ 0 it implies a T i x + a T i h = a T i z ≥ 0. Thus we have Using the above we can conclude that In (a) we plug in the indicator inequality, and accordingly remove the now-superfluous ReLU.In (b) we use the h notation, and regroup terms.To continue, we divide both sides by ∥h∥ 2 ℓ2 and use the notation ĥ = h e −∥h∥ ℓ 2 a T i ĥ 1 − e −∥h∥ ℓ 2 a T i ĥ ∥h∥ ℓ2 a T i ĥ.
To continue note that the function g(x, s) = e −sx (1 − e −sx ) s has non-positive derivative as ∂g ∂s = e −2sx (2sx − e sx (sx + 1) + 1) for all values of s and x.This implies that g(x, s) is a non-increasing function of s.Thus, we can conclude that i x e −ra T i ĥ 1 − e −ra T i ĥ a T i ĥ.
Thus we can focus on lower bounding i x e −r(a T i ĥ)+ 1 − e −r(a T i ĥ)+ (a T i ĥ) + over the set ĥ ∈ S n−1 : x T ĥ where we reintroduce superfluous ReLUs around a T i ĥ as it will be convenient in the next steps.To continue, note that the function f (h) = e −rh (1 − e −rh )h has derivative It is easy to verify numerically that for h ≥ 0 this gradient is maximized around h max ≈ 0.402673 r , with maximum value f ′ (h max ) ≈ 0.312334.Thus, for all h ≥ 0 As a result the function g(h) = e −h+ (1 − e −h+ )h + is a 1 3 -Lipschitz function of h.Thus for any h 1 ∈ R d and h 2 ∈ R d we have We now define the random variable X i ( ĥ) = 1 {a T i x≥0} e −2a T i x e −(a T i ĥ)+ 1 − e −(a T i ĥ)+ (a T i ĥ) + and note that the Lipschitzness of g implies that for any h 1 , h 2 ∈ R d we have ) is a sub-Gaussian random variable with sub-Gaussian norm on the order of ∥h 2 − h 1 ∥ ℓ2 , we have that for some constant c.We use c to denote any universal constant; note that this constant may vary between different lines.Thus using the centering rule for sub-Gaussian random variables, the centered processes Xi (h) : Using the rotational invariance property of sub-Gaussian random variables, this implies that the stochastic process has sub-Gaussian increments.That is, Thus using Exercise 8.6.5 of Vershynin (2018) we can conclude that holds with probability at least 1 − 2e −u 2 .Thus, we conclude that for all h obeying ∥h∥ ℓ2 ≤ r we have with probability at least 1 − 2e −n , for a constant c.
We can estimate and lower bound the expectation above using a numerical average over many (50000) two-dimensional Gaussian samples, with the two dimensions corresponding to a T ĥ and a T x ∥x∥ ℓ 2 , minimizing over all correlations between x and h at least ρ (i.e.all correlations in this case).We arrive at the following lower bound, for r = 1 4 ∥x∥ ℓ2 and ρ = −0.6.
This bound is illustrated in a "proof by picture" in Figure 4.

B.4 CORRELATION CONCENTRATION: CASE 2
In this case we will focus on controlling the correlation inequality in the region where h ∈ R n : ∥h∥ ℓ2 ≤ r and x T h ∥x∥ ℓ2 ∥h∥ ℓ2 ≤ ρ .
Consider the correlation In (a) we provide a lower bound by introducing an additional indicator function on a T i x, which allows us to remove the ReLU.In (b) we use the notation h := z − x, and combine terms.In (c) we again provide a lower bound by adding an indicator to restrict a T i h ≤ 0. By flipping the sign of h we can alternatively lower bound To this aim note that for s ≥ 0 we have e s ≥ 1 and (e s − 1) s ≥ s 2 .Thus, To continue, we introduce the notation ĥ = h ∥h∥ ℓ 2 , and divide both sides by ∥h∥ 2 ℓ2 .Thus we have Thus it suffices to lower bound e −2a T i x (a T i ĥ) 2 over ĥ ∈ S n−1 : x T ĥ ∥x∥ ℓ2 ≥ −ρ .
To continue using Jensen's inequality we have where we have defined the function which is a 1-Lipschitz function of v.We now define the random variable X i ( ĥ) = 1 {a T i x≥0} S a T i ĥ; and note that the Lipschitzness of S implies that for any h 1 , h 2 ∈ R d we have ) is a sub-Gaussian random variable with sub-Gaussian norm on the order of ∥h 2 − h 1 ∥ ℓ2 , we have that for some constant c.We use c to denote any universal constant; note that this constant may vary between different lines.Using the centering rule for sub-Gaussian random variables, the centered processes Xi (h) := Using the rotational invariance property of sub-Gaussian random variables, this implies that the stochastic process Thus using Exercise 8.6.5 of Vershynin (2018) we can conclude that holds with probability at least 1 − 2e −u 2 .In the last line we used the fact that the surface area of a spherical cap with distance at least ϵ away from the center is bounded by e −n ϵ 2 2 .By using u = √ n, this implies that holds with probability at least 1 − 2e −n , for a constant c.
We can estimate and lower bound the expectation above using a numerical average over many (50000) two-dimensional Gaussian samples, with the two dimensions corresponding to a T ĥ and a T x ∥x∥ ℓ 2 , minimizing over all correlations between x and h at most ρ (i.e.all correlations in this case).We arrive at the following lower bound, for r = 1 4 ∥x∥ ℓ2 and ρ = −0.6.
This bound is illustrated in a "proof by picture" in Figure 5.

B.5 BOUNDING THE GRADIENT NORM
Consider the gradient and note that where S n−1 denotes the set of all real n-dimensional unit-norm vectors.To continue, we use the Cauchy-Schwarz inequality: where ∥A∥ is the operator norm of a matrix comprised by stacking the vectors a i , and in (a) we used the fact that the function f (z) = e −(z)+ is 1-Lipschitz.Finally, using the fact that ∥A∥ ≤ 2( √ m + √ n) with probability at least 1 − e −0.5(m+n) , we conclude that with probability at least 1 − e −0.5 (m+n) , where L is a constant since we have the number of measurements at least a constant times the number of unknowns.This completes the proof of smoothness of the gradient towards the global optimum.
C PROOF OF THEOREM 2 The general strategy of the proof is similar to Theorem 1 but requires delicate modifications in each step.Concretely, we have the following four steps.
Step I: First iteration: Welcome to the neighborhood.
In the first step we show that the first iteration obeys with high probability as long as m ≥ cm 0 .We prove this in subsection C.1.
In this step we prove that the loss function is locally strongly pseudoconvex.Specifically we show that for all z ∈ K that also belong to the local neighborhoood N Here, the value of α is the same as in Theorem 1 (see Appendix B.1).We prove this in subsection C.2 by again considering two cases.
Step III: Local smoothness.
We also use the fact that the loss function is locally smooth, that is, holds with probability at least 1 − e −0.5(m+n) per Section B.5.
Step IV: Completing the proof via combining steps I-III.
In this step we show how to combine the previous steps to complete the proof of the theorem.First, note that by the first step the first iteration will belong to the local neighborhood N (x) and thus belongs to the set K ∩ N (x where in the last line we used the fact that µ t+1 ≤ α L 2 , completing the proof.

C.1 PROOF OF STEP I
The beginning of this proof is the same as the unregularized version where we note that The denominator in this expression is lower bounded by 0.2 for all ∥x∥ ℓ2 , so we can also guarantee the first step concentration to this neighborhood using m ≥ 5m 0 measurements.

C.2 PROOF OF STEP II
The proof of this step is virtually identical to that of the unregularized case.The only difference is that when we apply Exercise 8.6.5 of Vershynin (2018) n is replaced with m 0 (indeed this exercise is stated with m 0 ).As a result in the two cases we conclude Case I: In this case using the above yields ∇L(z) T (z − x) ≥ 1 r E 1 {a T x≥0} 1 {a T ĥ≥0} e −2a T x e −ra T i ĥ 1 − e −ra T i ĥ a T i ĥ − c r √ m 0 √ m holds with probability at least 1 − 2e −m0 , for a constant c.
Case II: In this case using the above yields ∇L(z) T (z − x) ≥ E 1 {a T x≥0} S a T ĥ; holds with probability at least 1 − 2e −m0 , for a constant c.
Thus the remainder of the proof is identical and the only needed change in this entire step is to replace n with m 0 .

Figure 1 :
Figure 1: Synthetic experiments using the Shepp-Logan phantom, showing a slice through the reconstructed 3D volume.From top to bottom, we increase the density of the central test ellipsoid to simulate soft tissue, bone, and metal.Nonlinear reconstruction is robust even to dense "metal" elements of the target signal.

Figure 3 :
Figure 3: Case 2 provides a lower bound on the expected correlation for both cases.
and c 2 .We compute the α in Equation (B.2) in two cases, where case 1 corresponds to x T (z−x) ∥x∥ ℓ 2 ∥z−x∥ ℓ 2 ≥ −0.6 and case 2 corresponds to x T (z−x) ∥x∥ ℓ 2 ∥z−x∥ ℓ 2 < −0.6.These cases are analyzed in Appendix B.3 and Appendix B.4, respectively.Combining cases 1 and 2, we have that the lower bound on α from case 2 lower bounds the bound from case 1, as shown in Figure 3, so we use that bound in eq.(B.2).The sample complexity in eq.(B.2) is the maximum over the sample complexities of cases 1 and 2, which are m ≥ , respectively, for constants c and C.

Figure 4 :
Figure 4: Lower bound on the expected correlation in case 1.

Figure 5 :
Figure 5: Lower bound on the expected correlation in case 2.